skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Wang"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available December 18, 2027
  2. Free, publicly-accessible full text available May 18, 2027
  3. Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques---such as deduplication and compression---are either LLM-oblivious or not compatible with each other, limiting data reduction effectiveness. Our large-scale characterization study across all publicly available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication is better aligned with model storage workloads, achieving high data reduction with low metadata overhead. Building on these insights, we design BitX, an effective, fast, lossless delta compression algorithm that compresses XORed difference between fine-tuned and base LLMs. We build ZipLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, ZipLLM reduces model storage consumption by 54%, over 20% higher than state-of-the-art deduplication and compression approaches. 
    more » « less
    Free, publicly-accessible full text available May 4, 2027
  4. Free, publicly-accessible full text available March 31, 2027
  5. Interactive notebook programming is universal in modern ML and AI workflows, with interactive deep learning training (IDLT) emerging as a dominant use case. To ensure responsiveness, platforms like Jupyter and Colab reserve GPUs for long-running notebook sessions, despite their intermittent and sporadic GPU usage, leading to extremely low GPU utilization and prohibitively high costs. In this paper, we introduce NotebookOS, a GPU-efficient notebook platform tailored for the unique requirements of IDLT. NotebookOS employs replicated notebook kernels with Raft-synchronized replicas distributed across GPU servers. To optimize GPU utilization, NotebookOS oversubscribes server resources, leveraging high inter-arrival times in IDLT workloads, and allocates GPUs only during active cell execution. It also supports replica migration and automatic cluster scaling under high load. Altogether, this design enables interactive training with minimal delay. In evaluation on production workloads, NotebookOS saved over 1,187 GPU hours in 17.5 hours of real-world IDLT, while significantly improving interactivity. 
    more » « less
    Free, publicly-accessible full text available March 22, 2027
  6. Free, publicly-accessible full text available March 1, 2027
  7. We study optimal pricing in a single-server queueing system that can be observable or unobservable, depending on how customers receive information to estimate sojourn time. Our primary objective is to determine whether the service provider is better off making the system observable or unobservable under optimal pricing. We formulate the optimal pricing problem using Markov decision process (MDP) models for both observable and unobservable systems. For unobservable systems, the problem is studied using an MDP with a fixed-point equation as equilibrium constraints. We show that the MDPs for both observable and unobservable queues are special cases of a generalized arrivals-based MDP model, in which the optimal arrival rate (rather than price) is set in each state. Then, we show that the optimal policy that solves the generalized MDP exhibits a monotone structure in that the optimal arrival rate is non-increasing in the queue length, which allows for developing efficient algorithms to determine optimal pricing policies. Next, we show that if no customers overestimate sojourn time in the observable system, it is in the interest of the service provider to make the system observable. We also show that if all customers overestimate sojourn time, the service provider is better off making the system unobservable. Lastly, we learn from numerical results that when customers are heterogeneous in estimating their sojourn time, the service provider is expected to receive a higher gain by making the system observable if on average customers do not significantly overestimate sojourn time. 
    more » « less
    Free, publicly-accessible full text available March 1, 2027
  8. Free, publicly-accessible full text available March 1, 2027
  9. Free, publicly-accessible full text available February 1, 2027